The Bible, truth, and multilingual OCR evaluation

نویسندگان

Tapas Kanungo

Philip Resnik

چکیده

Multilingual OCR has emerged as an important information technology, thanks to the increasing need for crosslanguage information access. While many research groups and companies have developed OCR algorithms for various languages, it is di cult to compare the performance of these OCR algorithms across languages. This di culty arises because most evaluation methodologies rely on the use of a document image dataset in each of these languages and it is di cult to nd document datasets in di erent languages that are similar in content, layout, and fonts. In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Bible , Truth , and Multilingual OCR

Multilingual OCR has emerged as an important information technology, thanks to the increasing need for cross-language information access. While many research groups and companies have developed OCR algorithms for various languages, it is diicult to compare the performance of these OCR algorithms across languages. This diiculty arises because most evaluation methodologies rely on the use of a do...

متن کامل

Evaluating SEE - A Benchmarking System for Document Page Segmentation

The decomposition of a document into segments such as text regions and graphics is a significant part of the document analysis process. The basic requirement for rating and improvement of page segmentation algorithms is systematic evaluation. The approaches known from the literature have the disadvantage that manually generated reference data (zoning ground truth) are needed for the evaluation ...

متن کامل

Title of Thesis : GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION

Title of Thesis: GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION Gang Zi, Master of Science, 2005 Thesis Directed By: Professor Rama Chellappa Department of Electrical and Computer Engineering University of Maryland at College Park The problem of generating synthetic data for the training and evaluation of document analysis systems has been widely addressed in recent years. With the incre...

متن کامل

Reducing OCR Errors by Combining Two OCR Systems

This paper describes our efforts in building a heritage corpus of Alpine texts. We have already digitized the yearbooks of the Swiss Alpine Club from 1864 until 1982. This corpus poses special challenges since the yearbooks are multilingual and vary in orthography and layout. We discuss methods to improve OCR performance and experiment with combining two different OCR programs with the goal to ...

متن کامل

An OCR Free Method for Word Spotting in Printed Documents: the Evaluation of Different Feature Sets

An OCR free word spotting method is developed and evaluated under a strong experimental protocol. Different feature sets are evaluated under the same experimental conditions. In addition, a tuning process in the document segmentation step is proposed which provides a significant reduction in terms of processing time. For this purpose, a complete OCRfree method for word spotting in printed docum...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1999

The Bible, truth, and multilingual OCR evaluation

نویسندگان

چکیده

منابع مشابه

The Bible , Truth , and Multilingual OCR

Evaluating SEE - A Benchmarking System for Document Page Segmentation

Title of Thesis : GROUNDTRUTH GENERATION AND DOCUMENT IMAGE DEGRADATION

Reducing OCR Errors by Combining Two OCR Systems

An OCR Free Method for Word Spotting in Printed Documents: the Evaluation of Different Feature Sets

عنوان ژورنال:

اشتراک گذاری